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The onset of several silent, chronic diseases such as diabetes can 
be detected only through diagnostic tests. Due to cost considerations, 
self-reported outcomes are routinely collected in lieu of expensive di¬ 
agnostic tests in large-scale prospective investigations such as the 
Women’s Health Initiative. However, self-reported outcomes are sub¬ 
ject to imperfect sensitivity and specificity. Using a semiparametric 
likelihood-based approach, we present time to event models to esti¬ 
mate the association of one or more covariates with a error-prone, 
self-reported outcome. We present simulation studies to assess the 
effect of error in self-reported outcomes with regard to bias in the 
estimation of the regression parameter of interest. We apply the pro¬ 
posed methods to prospective data from 152,830 women enrolled in 
the Women’s Health Initiative to evaluate the effect of statin use with 
the risk of incident diabetes mellitus among postmenopausal women. 

The current analysis is based on follow-up through 2010, with a me¬ 
dian duration of follow-up of 12.1 years. The methods proposed in 
this paper are readily implemented using our freely available R soft¬ 
ware package icensmis, which is available at the Comprehensive R 
Archive Network (CRAN) website. 

1. Introduction. The onset of several chronic diseases such as diabetes 
are asymptomatic and can be detected only through diagnostic tests. For ex¬ 
ample, diabetes can be detected by measuring levels of fasting blood glucose 
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or glycosylated hemoglobin levels (HbAlc). However, the costs of such gold 
standard diagnostic tests can be prohibitive in large-scale epidemiological 
studies such as the Women’s Health Initiative (WHI) that enroll and fol¬ 
low over a hundred thousand subjects. Disease prevalence and incidence in 
large observational cohorts are often ascertained through error-prone, self- 
reported questionnaires. In this paper, we propose a semiparametric model 
to assess the association of specific covariates of interest with a silent time 
to event outcome that is assessed through periodic, error-prone self-reports. 

Using data from postmenopausal women enrolled in the WHI, the moti¬ 
vating application in this paper is the evaluation of the hypothesis that the 
use of cholesterol lowering medications (statins) can result in an increased 
risk of diabetes. The WHI recruited women {N = 161,808) aged 50-79 at 
40 clinical centers across the U.S. from 1993-1998 with ongoing follow-up 
[Anderson et al. (1998)]. Prevalent and incident diabetes during the course 
of follow-up was ascertained by self-report obtained at each annual visit. In 
a recent paper. Culver et al. (2012) presented an analysis of the effects of 
statin use on the risk of incident diabetes in the WHI using Cox proportional 
hazards models. The analyses were conducted based on the assumption that 
self-reported outcomes of prevalent and incident diabetes are error-free. The 
validity of self-reports of incident and prevalent diabetes have been evalu¬ 
ated using data from a substudy nested within the WHI—when compared 
to fasting glucose levels (treated as the gold standard), diabetes self-reports 
had a positive predictive value of 74% and negative predictive value of 97% 
[Margolis et al. (2008), Jackson et al. (2014)]. Other studies such as the 
Nurses’ Health Study, Physicians’ Health Study and the Finnish Public Sec¬ 
tor Study also commonly use self-reported outcomes [He et al. (2010), Hu 
et al. (2001), Oksanen et al. (2010)]. 

When a perfect diagnostic test is given sequentially at different points in 
time to the same individual, the time until the event of interest can be deter¬ 
mined to lie in the interval between the last negative test and the first pos¬ 
itive test—that is, the time until the event is interval censored. In this con¬ 
text, methods for estimating the survival distribution and assessing the ef¬ 
fect of covariates have been developed [Turnbull (1976), Finkelstein (1986)]. 
However, when error-prone diagnostic procedures such as self-reports are 
used, standard methods for interval censored outcomes are rendered invalid. 
Previous work in this area includes methods for error-prone outcomes with 
application to data collected from laboratory-based diagnostic tests in stud¬ 
ies in HIV, HPV and STD [Balasubramanian and Lagakos (2001, 2003), 
McKeown and Jewell (2010), Meier, Richardson and Hughes (2003)]. Bala¬ 
subramanian and Lagakos (2003) developed a formal likelihood framework 
to estimate the distribution of the time to mother to child transmission of 
HIV. The proposed methods were applied to data from imperfect DNA PCR 
diagnostic tests to detect the presence of HIV in infants who were born to 
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HIV-positive pregnant women. Meier, Richardson and Hughes (2003) ex¬ 
tended the discrete proportional hazard model to incorporate outcomes and 
covariates. In related work, several papers proposed generalized Cox models 
in settings involving time to event outcomes with incomplete event adjudica¬ 
tion [Snapinn (1998), Cook (2000), Cook and Kosorok (2004)]. Other related 
work includes that proposed by McKeown and Jewell (2010) in the context 
of HPV studies, where the authors accommodate misclassification by incor¬ 
porating ideas of binary generalized linear models with outcomes subject to 
misclassification [Neuhaus (1999)]. The problem of error-prone time to event 
outcomes can also be handled through the Hidden Markov Model (HMM) 
framework. Previous applications of HMM-based methods include the areas 
of breast cancer [Chen, Duffy and Tabar (1996)], HIV [Satten and Longini 
(1996), Guihenneuc-Jouyaux, Richardson and Longini (2000)], lung trans¬ 
plantation [Jackson and Sharpies (2002)] and cervical smear tests [Kirby and 
Spiegelhalter (1994)]. Jackson et al. (2003) present a general framework for 
staged Markov models to handle misclassification due to error-prone screen¬ 
ing tests. Other recent methodological advances within the general area of 
outcomes measured with error include the papers by Garcfa-Zattera et al. 
(2012) and Lyles et al. (2011), as well as works on covariate measurement 
error with application to the WHI and the Nurses Health Study [Shaw and 
Prentice (2012), Spiegelman, Rosner and Logan (2000)]. However, none of 
the previous literature specifically considers error-prone, self-reported time 
to event outcomes. 

In this paper we present a likelihood-based approach to incorporate time- 
varying covariate effects specihc to the setting in which the prevalence and 
incidence (time to event) of a chronic condition such as diabetes is ascer¬ 
tained through error-prone self-reports. We incorporate the situation where 
an unknown proportion of subjects who have already experienced the event 
of interest at baseline are mistakenly included into the study, due to the use 
of error-prone self-reports at study entry. We also provide a freely available 
R software package and illustrate its use [Gu and Balasubramanian (2013)]. 
In Section 2 we present notation, form of the likelihood function, address 
issues related to estimation and extensions to incorporate misclassihcation 
of subjects at study entry. In Section 3 we perform simulation studies to 
evaluate the effects of various degrees of error in self-reports. We investigate 
the effects of erroneous inclusion of subjects who have already experienced 
the event of interest due to imperfect negative predictive values associated 
with self-reports. In Section 4 we evaluate the association between statin 
use with the risk of incident diabetes in a subset of 152,830 women enrolled 
in the WHI. Last, in Section 5 we discuss the findings of this study and 
highlight future directions. 
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2. Methods. In this section we present notation, form of the likelihood 
and extensions to incorporate the possibility of misclassihcation at study 
entry. 

2.1. Notation, likelihood, estimation. Let X refer to the random variable 
denoting the unobserved time to event for an individual, with associated 
survival, density and hazard functions denoted by S{x),f{x) and A(a:), for 
X > 0, respectively. The time origin is set to 0, corresponding to the base¬ 
line visit at which all subjects enrolled in the study are event-free. In other 
words, Pr(AL > 0) = 1. Without loss of generality, we set X = oo when the 
event of interest does not occur. Let N denote the number of subjects and 
Ui denote the number of visits for the ith subject. At each visit, we assume 
that each subject would self-report their disease status. For example, in the 
WHI, information on incident diabetes was collected at periodically sched¬ 
uled visits using self-reported questionnaires. For the ith subject, we let Rj 
and tj denote the 1 x n* vectors of self-reported, binary outcomes and corre¬ 
sponding visit times, respectively. In particular, Rij is equal to 1 if the jth 
self-report for the ith subject is positive (indicating occurrence of the event 
of interest such as diabetes) and 0 otherwise. We assume that self-reports 
are collected at prescheduled visits up to the time of the hrst positive self- 
report, thus, the vectors of test results (Ri), visit times (tj) and the number 
of self-reports collected per subject (n,) are random. Let denote 

the distinct, ordered visit times in the data set among N subjects, where 
0 = To < Ti < • • • < rj < Tj+i = oo, thus, the time axis can be divided into 
J -I- 1 disjoint intervals, [0,ri), [ri,r 2 ),..., [rj, cx)). 

The joint probability of the observed data for the ith subject can be 
expressed as 

J+i 

g(Ri,ti,ni) = ^Pr(rj_i < Xi < Tj) Pr(Rj, tj, ni|rj_i < W < tj) 
i=i 
J+i 

= ^6»jPr(Rj,tj,ni|Tj_i < Xi < tj), 

i=i 

where 9j = Pr(rj_i < X < Tj), tq = 0 and rj+i = oo. 

To simplify the form of the expression above, we make the assumption 
that given the true time of event Xi, an individual’s self-reports are 
independent. That is, 

Ui 

Pr(Rj|Ai,ti) = Pr(rjfc|Ai,Lfc). 
k=l 

This assumption implies that the observed values of other self-reported out¬ 
comes do not provide additional information about the distribution of a 
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particular self-reported outcome from that provided by the actual time of 
the event. 

Based on the derivation in Balasubramanian and Lagakos (2003), it can 
be shown that the joint probability of the observed data for the ith subject 
can be simplihed as 

J+i 

, ti, Uj) — ^ ^ dj 

(2.1) 

J+i 

where Qj = [YlkUP^{rik\Tj-i < Xi <Tj,tk)]- We assume that the proba¬ 
bility of a positive self-report at the kth. visit {rik = 1) conditional on the 
interval containing the true event time and visit time can be expressed as 

Pr(r,fc = l|r,_i < W < t,-,4) = | | 

Here, cpi and <^0 denote the sensitivity and specificity of self-reports, re¬ 
spectively. Thus, the terms Cij, for j = 1,..., J -t- 1, in equation (2.1) can 
be expressed as a product involving the constants ipi and ipQ. Thus, in the 
absence of covariates, the log likelihood for a random sample of N subjects 
can be expressed as 

N /J+1 

(2.2) m = log(L(0)) = ^log ^ a.Oj 

i=l \j=l 

For the special case where self-reports are perfect {(pi = 999 = 1), equation 

(2.2) reduces to the nonparametric likelihood for interval censored observa¬ 
tions given in Turnbull (1976). 

In most settings, including the WHI, it is of interest to evaluate the as¬ 
sociation of a vector of covariates with respect to the time to event of in¬ 
terest. Let Z denote the P x 1 vector of explanatory variables with the 
corresponding P x 1 vector of regression coefficients denoted by /3. To incor¬ 
porate the effect of covariates, we assume the proportional hazards model, 
A(t|Z = z) = Ao(t)e^^l^, or, equivalently, ^(tlZ = z) = 5o(t)'^^ ^■ 

To derive the form of the log-likelihood based on the assumption of the 
proportional hazards model, we first reparameterize the log likelihood in 

(2.2) in terms of the survival function, S = (1 = Si, S 2 , ■ ■ ■, , where 

Sj = Pr(AL > Tj-i). Since Sj = Yli=j^ the vector of interval probabilities 
can be expressed as 0 = Pr-S, where is the (J-|-1) x (J-|-1) transformation 
matrix. Let C = [Cij] denote the x (J -|- 1) matrix of the coefficients, Cij, 


rii 

n Pr(rife|rj_i <Xi< Tj,tk) 

.k=l 
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and let the N x (J + l) matrix D be defined as Dnx{j+i) = C x T^. Then, the 
log-likelihood function for the one-sample setting in (2.2) can be expressed 
as 

N /J+l 

(2.3) ^(S) = j;iog 

i=i \j=i 

where (Si = 1 and 82 , 83 ,, 8 j+i are the unknown parameters of interest. 

Let 1 = 81 > 82 > ■ ■ ■ > 8 J+l denote the baseline survival functions (i.e., 
corresponding to Z = 0), evaluated at the left boundaries of the intervals 
[0, Ti), [ri, r 2 ),... , [tj, 00 ). Then, for subject i, with corresponding covariate 

vector Zj, = { 8 jY Thus, the log-likelihood function for a random 
sample of N subjects can be expressed as 

N /J+l 

(2.4) /(S,/3) = ^log 

i=i \j=i 

The elements of the D matrix are functions of the observed data includ¬ 
ing the visit times and corresponding self-reported results, as well as the 
constants (pQ,ipi. Assuming that ipo,^i are known constants, the maximum 
likelihood estimates of the unknown parameters ^i,..., ^p, 82 ,..., 8 jpi can 
be obtained by numerical maximization of the log-likelihood function, sub¬ 
ject to the constraints that 1 > 82 > 83 > ■ ■ ■ > 8 j+i > 0. Statistical infer¬ 
ence regarding the parameters of interest (/3i,..., I 3 p, 82 ,..., <S'j+i) can be 
made by using asymptotic properties of the maximum likelihood estimators 
[Cox and Hinkley (1979)]. The estimated covariance matrix of the maximum 
likelihood estimates can be obtained by inverting the Hessian matrix. Hy¬ 
pothesis tests regarding the unknown parameters can be carried out using 
the likelihood ratio or Wald test. 

2.2. Misclassification at study entry. In this section we incorporate the 
setting in which a self-report of being event (disease)-free at baseline or study 
entry is used as the inclusion criterion. The evaluation of the association 
between statin use and risk of incident diabetes in the WHI was based on all 
women who self-reported to be diabetes-free at baseline [Culver et al. (2012)] . 
However, diabetes self-reports at study entry in the WHI have been found 
to be less than perfect—the study by Margolis et al. (2008) found that the 
negative predictive value of prevalent diabetes at baseline was approximately 
97%, that is, 3% of women who self-reported as being diabetes-free were, in 
fact, diabetic. In this situation, the assumption that 5(0) = 1 is invalid. 

For the ith subject, let Gi denote the baseline binary self-report, where 
Gi = l denotes a self-report indicating that the event of interest has already 
occurred and Gi = 0 denotes otherwise. Similarly, let A denote the true 
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event status at baseline. In other words, Bi = 1 = Xi < 0 and Bi = 0 = Xi > 
0. Consider a subject who has a negative self-report at baseline (i.e., Gi = 0) 
and is thus included in the data set. As before, let the vector of observed 
self-reports for the ith subject be denoted by Rj. Let the negative predictive 
value of self-reports at baseline be denoted by r], that is, Pr(Rj = OjGj = 0) = 
rj. Then the likelihood function for the zth subject can be expressed as 

Li = Pr(Rj,ti,ni|Gi = 0) 

(2.5) =7?Pr(Ri,tj,ni|Ri = 0,Gj = 0) 

-L (1 - r/)Pr(Rj,tj,ni|Ri = l,Gi = 0). 

We assume that subjects who self-report negative {Gi = 0) and are truly neg¬ 
ative for event at baseline {Bi = 0) are a random sample from all subjects 
who are true negative at baseline. Then we have Pr(Rj,tj,ni|Ri = 0,Gj = 
0) = Pr(Rj,tj,=0), which corresponds to the likelihood function de¬ 
rived in Section 2.1. Thus, PT{Ili,ti,ni\Bi = 0,Gi = 0) = Dij{Sj)^ ■ 

Moreover, Pr(Ri,ti,ni|Ri = l,Gj = 0) = 

The likelihood function for the zth subject has the form 

^+1 

L,{(3,S)=vY.D,,{Sjf^^ + {l-v)Da{Sif^^ 

i=i 

( 2 . 6 ) 

J+i 

= E^u(^r"^ 

i=i 

where = Du and D'- = r]Dij for j > 1. Thus, the likelihood function in¬ 
corporating baseline misclassification has the same general form as in equa¬ 
tion (2.4). The likelihood function in equation (2.4) can be obtained as a 
special case when r] = 1 in equation (2.6). 

2.3. Time-varying covariates. We consider the situation where covariate 
values can change with time and are collected at each visit. Let Zij denote 
the p X 1 vector of covariate values for subject i at time Tj. In extending 
the likelihood function [equation (2.4)] to handle time-varying covariates, we 
make the additional assumption that the values of the covariates Zij remain 
constant during the interval [rpTj+i). Let Aj denote the cumulative hazard 
function during the period of [rj,Tj+i) for the subjects in the reference 
group (i.e., Z = 0). Under the model Xzi{t) = Xo{t)ef^^\ the corresponding 
cumulative hazard function during the period [Tj,Tj+i) for subject i is equal 
to Ajexp(zL/3). The survival function at Tj-i can then be expressed as 
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where j = 2,..., J + 1, where = 1. The log-likelihood function can be 
expressed as a function of the derived , 

N /J+1 

Z(S,/3) = J]log 

i=i \j=i 

The log-likelihood function can be optimized with respect to the parameters 
Aq, ..., Aj-i a-iid Pi,. ■. ,Pp subject to constraints Aj > 0. In practice, if a 
snbject has missing visits or missing covariate valnes at some visits, one 
can carry forward the last observation as one approach to impute missing 
covariate values. However, unless the proportion of missing is very small, 
these ad hoc approaches toward handling missing data may result in biased 
estimates of parameters and their associated standard errors. 

2.4. Unknown sensitivity and specificity. Identifiability of the sensitiv¬ 
ity and specificity parameters is closely tied to the study design and the 
paradigm used for determining number and timing of visits (tests). For ex¬ 
ample, in several epidemiological cohorts in which self-reported outcomes of 
chronic diseases snch as diabetes are collected, data collection on the inci¬ 
dence of the condition ceases following the first positive self-report. In snch 
study designs, it is implicitly assumed that self-reports following the first 
positive self-report will be positive with probability 1, thus, subsequent self- 
reports are noninformative. In settings that incorporate an adaptive testing 
paradigm, the form of the likelihood is shown in equation (2.4)— while this 
is a fnnction of the constants (pi,(po that characterize the sensitivity and 
specificity of self-reports, these parameters cannot be estimated jointly with 
the parameters of interest, namely, fii,..., fip, S 2 , ■ ■ ■, Sj+i. If the sensitiv¬ 
ity and specificity parameters are unknown, an augmented study design in 
which a subset of subjects are given a perfect diagnostic test in addition to 
self-reported questionnaires conld be considered. In these studies, the pa¬ 
rameters <pi,(po can be jointly estimated with the unknown parameters of 
interest. A similar approach was proposed by Lyles et al. (2011) for mismea- 
sured outcomes in logistic regression models. 

In other clinical settings, the mismeasured outcome arises from laboratory- 
based diagnostic tests characterized by imperfect sensitivity and specificity. 
When the testing paradigm involves giving the diagnostic test according to 
a predetermined testing schedule, the form of the likelihood can be shown to 
be identical to that in equation (2.4) [Balasubramanian and Lagakos (2003)]. 
In this case, it is possible to observe seemingly inconsistent patterns of test 
results where one or more negative test results follow a positive result. Ex¬ 
amples include data collected from DNA PCR assays to detect HIV infection 
in infants in pediatric HIV clinical trials. Studies in which subjects are tested 
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according to a predetermined testing schedule, the sensitivity, specificity pa¬ 
rameters can be jointly estimated with the unknown parameters of 

interest [Meier, Richardson and Hughes (2003)]. 

3. Simulation. In this section we present results from simulation studies 
to illustrate the effects of (1) error-prone self-reported outcomes; and (2) 
misclassification at study entry. We present the effects of these factors with 
regard to the bias associated with the estimated regression parameter of 
interest. 

3.1. Effects of error-prone self-reported outcomes. We present average 
results from 1000 simulated data sets in which 1000 subjects were randomly 
assigned to two exposure groups with equal proportion, assuming all sub¬ 
jects were event-free at baseline (i.e., W > 0 for all i). We assumed that there 
is a single binary covariate of interest Zi, corresponding to the exposure sta¬ 
tus of the ith subject. The associated regression parameter in the likelihood 
[equation (2.4)] was set to /3 = 1. For each subject, self-reported question¬ 
naires were collected at 8 scheduled visits over a duration of 8 years, each 
with a random missing probability of 30%. All self-reports following the first 
positive report were assumed to be positive with probability 1. The simula¬ 
tion mechanism assumed that the time to the event of interest X followed an 
exponential distribution. The hazard rate A governing the time to the event 
of interest in the reference group {Zi = 0) was set to equal 0.0132 or 0.0866, 
corresponding to a cumulative incidence by study end (1 — Sj+i) of 0.10 or 
0.50, respectively. As shown in Table 1, we compare results across several 
sets of values for the parameters corresponding to the sensitivity 

and specificity of self-reports. 

In Table 1, for each parameter setting, we present estimates of bias, associ¬ 
ated standard error, root mean square error (RMSE) and coverage probabil¬ 
ity associated with the estimation of (3. Coverage probability was calculated 
as the proportion of data sets in which the 95% confidence interval for /3 
contains its true value. We compare results from two sets of analyses for 
estimating (3: (a) maximizing the likelihood presented in equation (2.4), as¬ 
suming that the true values of are known; and (b) maximizing the 

likelihood presented in equation (2.4), assuming that self-reports are perfect 
(i.e., ipi = ifo = 1). In general, when the true values of ipo,<pi are incorpo¬ 
rated into the analysis, the estimates of (3 are nearly unbiased. Similarly, 
the true coverage probability corresponding to a 95% confidence interval is 
close to its nominal value. On the other hand, when self-reports are incor¬ 
rectly assumed to be perfect, the estimates of j3 may be significantly biased, 
especially in settings where is low. When -C 1, early false positive self- 
reports result in significant loss of information due to premature cessation 
of data collection. In this case, coverage probabilities deviated significantly 
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Table 1 

Comparing estimates of the regression parameter ft from an “adjusted” analysis that 
accounts for the error in self-reported outcomes to an “unadjusted” analysis that 
incorrectly assumes that self-reports are perfect 


VI 

(fo 

■Sj-t-i 

Analysis type 

Bias (%) 

Std Err 

RMSE 

Coverage (%) 

0.75 

1.00 

0.90 

Adjusted 

0.3 

0.17 

0.17 

96.8 

0.75 

1.00 

0.90 

Unadjusted 

0.1 

0.17 

0.17 

97.0 

1.00 

0.75 

0.90 

Adjusted 

-6.7 

0.82 

0.82 

93.8 

1.00 

0.75 

0.90 

Unadjusted 

-90.2 

0.07 

0.90 

0.0 

0.61 

0.995 

0.90 

Adjusted 

1.4 

0.21 

0.22 

94.9 

0.61 

0.995 

0.90 

Unadjusted 

-16.4 

0.17 

0.23 

82.9 

0.75 

1.00 

0.50 

Adjusted 

0.1 

0.09 

0.09 

95.1 

0.75 

1.00 

0.50 

Unadjusted 

-1.9 

0.09 

0.09 

93.5 

1.00 

0.75 

0.50 

Adjusted 

0.2 

0.19 

0.19 

94.4 

1.00 

0.75 

0.50 

Unadjusted 

-59.2 

0.07 

0.60 

0.0 

0.61 

0.995 

0.50 

Adjusted 

0.5 

0.09 

0.09 

94.2 

0.61 

0.995 

0.50 

Unadjusted 

-6.9 

0.08 

0.11 

86.7 


from 95%. Last, incorporating the uncertainty in error-prone self-reports 
increases the standard error of the maximum likelihood estimates of [3. 

We note that while the true event times were simulated based on the ex¬ 
ponential distribution, the proposed methods make no distribution assump¬ 
tions. Thus, the performance of the proposed methods does not depend on 
the underlying distributions of the event times. When event times were sim¬ 
ulated based on a Weibull distribution, similar results were observed (results 
available upon request). 

3.2. Effects of misclassification at study entry. In this simulation we in¬ 
corporate the setting in which an error-prone, self-report of being 
event(disease)-free at study entry is used as the inclusion criterion. As be¬ 
fore, let r] denote the negative predictive value of the baseline self-report. 
That is, each subject included in the study has a probability of 1 — ?] of 
having already experienced the event of interest prior to study entry. Each 
simulated data set included 1000 subjects, of whom 1000 x {1 — rj) had al¬ 
ready experienced the event of interest prior to entry into the study (i.e., 
A < 0). The data were simulated as described in Section 3.1, where cpi = 0.61 
and ipo = 0.995. We compare results for various settings by varying the cu¬ 
mulative incidence of the event of interest (1 — Sj+i) to equal 0.10 or 0.50, 
and by varying the value of ij to equal 0.99, 0.96 or 0.93. 

Table 2 presents the simulation results, averaged over 1000 data sets. We 
present results from an “adjusted” model that properly accounts for mis¬ 
classification at baseline based on the likelihood presented in equation (2.6) 
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Table 2 

Comparing estimates of the regression parameter p from an “adjusted” analysis that 
incorporates the possibility of misclassification at baseline to an “unadjusted” analysis 
that incorrectly assumes that all subjects are event-free at study entry or that r]=l. We 
assume that (pi = 0.61 and ipo = 0.995 


Sj-pi 

V 

Analysis type 

Bias (%) 

Std Err 

RMSE 

Coverage (%) 

0.90 

0.99 

Adjusted 

2.6 

0.22 

0.23 

95.0 

0.90 

0.99 

Unadjusted 

-4.5 

0.20 

0.21 

94.1 

0.90 

0.96 

Adjusted 

1.2 

0.24 

0.24 

95.8 

0.90 

0.96 

Unadjusted 

-22.9 

0.17 

0.29 

72.7 

0.90 

0.93 

Adjusted 

0.1 

0.25 

0.25 

95.2 

0.90 

0.93 

Unadjusted 

-36.4 

0.15 

0.40 

36.3 

0.50 

0.99 

Adjusted 

0.0 

0.09 

0.09 

95.2 

0.50 

0.99 

Unadjusted 

-1.5 

0.09 

0.09 

94.1 

0.50 

0.96 

Adjusted 

0.1 

0.10 

0.10 

94.2 

0.50 

0.96 

Unadjusted 

-5.7 

0.09 

0.11 

89.2 

0.50 

0.93 

Adjusted 

0.6 

0.10 

0.10 

94.1 

0.50 

0.93 

Unadjusted 

-9.4 

0.09 

0.13 

80.9 


compared to the model in equation (2.4) that incorrectly assumes that r] = 1 
(denoted “unadjusted”). In both models, the true values of the sensitivity 
and specificity are assumed. As expected, the adjusted model is nearly unbi¬ 
ased and has uniformly lower bias when compared to the unadjusted model. 
The bias of the unadjusted model increases with decreasing values of neg¬ 
ative predictive value (r/), and it is more pronounced when the cumulative 
incidence is low (1 — Sj+i = 0.10). In general, the inclusion of subjects who 
have already experienced the event of interest at study entry results in the 
exposure groups becoming less distinguishable. Thus, ignoring this issue in 
data analysis results in estimates of exposure effects (/3) that are biased 
toward the null. In contrast, incorporating the effect of baseline misclassi¬ 
fication increases the standard error of (5. The effects on the bias and the 
standard error of fp are reflected in the RMSE values—the adjusted model 
has smaller RMSE than the unadjusted model in all settings except when 
S'j+i = 0.9 and ij = 0.99. The coverage probability of the adjusted model 
is approximately 95% in all settings considered in this study. However, the 
coverage probability of the unadjusted model decreases with decreasing neg¬ 
ative predictive value ( 77 ) due to increased bias. 

4. Application: Risk of diabetes mellitus with statin nse in the Women’s 
Health Initiative. 

Background. We analyze data collected on 152,830 women from the 
Women’s Health Initiative (WHI) to evaluate the effects of statin use on 
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the risk of incident diabetes mellitus (DM). Culver et al. (2012) reported an 
increased risk of incident DM with baseline statin use (multivariate-adjusted 
HR, 1.48; 95% Cl, 1.38-1.59). These results were based on Cox proportional 
hazards models where the time to event variable was calculated as the in¬ 
terval between enrollment date and the earliest of the following: (1) date 
of annual medical history update when new diabetes is self-reported (posi¬ 
tive outcome); (2) date of last annual medical update during which diabetes 
status can be ascertained (censorship); or (3) date of death (censorship). 
The methods used in Culver et al. (2012) were based on the assumptions 
that: (1) all subjects who self-reported as being diabetes-free at baseline 
were truly not diabetic (i.e., ?] = 1); and (2) the self-reports of incident dia¬ 
betes at each follow-up visit were error-free (i.e., ^pi = (pQ = 1). We compare 
the results from Culver et al. (2012) to results based on application of the 
likelihood-based methods described in this paper. 

Diabetes self-reports. Prevalent diabetes at baseline and incident dia¬ 
betes were assessed through self-reported questionnaires in the WHI. At 
baseline and at each annual visit, participants were asked whether she has 
ever received a physician diagnosis of and/or treatment for diabetes when 
not pregnant since the time of the last self-report (visit). Using data from 
a WHI substudy, estimates of sensitivity, specificity and baseline negative 
predictive value of self-reported diabetes outcomes were obtained by com¬ 
paring self-reported outcomes to fasting glucose levels and medication data 
[Margolis et al. (2008)]. A woman was considered to be truly diabetic if 
she had either taken anti-diabetic medication and/or had a fasting glucose 
level >126 mg/dl. From a representative subset of 5485 women, with in¬ 
formation at baseline on self-reported diabetes, fasting glucose levels and 
medication inventory, we estimated that self-reports have a sensitivity of 
0.61, the specificity of 0.995, and a negative predictive value of 0.96 at base¬ 
line. These estimated parameter values are used in our analysis. We used the 
following definitions: (1) sensitivity: proportion of diabetics with a positive 
self-report; (2) specificity: proportion of nondiabetics with a negative self- 
report; and (3) negative predictive value: proportion of subjects who were 
diabetes-free among those with a negative self-report. In practice, estimat¬ 
ing measurement error parameters from validation studies should proceed 
with caution as validation studies may differ from their study populations. 

Methods. The analysis data set included 152,830 women out of a total 
of 161,808 women enrolled in the WHI. Women who self-reported diabetes 
at baseline or those who ever took Cerivastatin were excluded. In addition, 
women with missing data at baseline on diabetes status or medication in¬ 
ventory were excluded [Culver et al. (2012)]. The results presented here are 
based on follow-up until 2010. The median duration of follow-up was 12.1 
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years, including 1,688,967 person-years of total follow-up. During the course 
of follow-up, 10.4% of women self-reported being diagnosed with diabetes. 
Information on statin use was obtained from medical inventory information, 
which was available for selected follow-up years. Information on statin use 
was available for 152,830, 59,505, 128,507, 55,043 and 12,039 subjects at 
baseline, years 1, 3, 6 and 9, respectively. Models included either baseline 
statin use or statin use as a time-varying covariate—in the latter case, the 
most recent medication inventory data available was carried forward for time 
points at which current medication use was not collected. In multivariable 
models, other covariates included race, smoking status, alcohol intake, age, 
education, WHI study, BMI, recreational physical activity, dietary energy 
intake, family history of diabetes and hormone therapy use [Culver et al. 
(2012)]. We assumed that self-reports following the first report of incident 
diabetes are noninformative. Annual visit times were rounded to the nearest 
year in order to limit the number of parameters estimated to describe the 
baseline survival function {S 2 , ■ ■ ■, Sj+i). 

Results. Table 3 presents the estimated hazard ratio (95% confidence in¬ 
terval) for statin use by modeling statin use at baseline or as a time-varying 
covariate. For each, we present results from univariable models as well as 
multivariable models incorporating potential confounders. In each setting, 
the results from the methods proposed in this paper are compared to results 
from Cox models. In all models, by incorporating the imperfect sensitivity 
and specificity of self-reports and the potential misclassification at study en¬ 
try, the hazard ratio of statin use is consistently increased when comparing 
to the corresponding Cox models. Using the proposed methods in equation 


Table 3 

Analysis of the effects of statin use on incident diabetes mellitus risk in the WHI 


Statin variable type 

Type of analysis 

Uni variable/ 
mnltivariable* 

N 

Hazard ratio 
(95% Cl) 

Baseline statin 

Proposed model 

Uni variable 

152,830 

2.33 (2.12, 2.56) 

Baseline statin 

Proposed model 

Multivariable 

138,338 

1.81 (1.65, 1.99) 

Baseline statin 

Cox model 

Uni variable 

152,830 

1.69 (1.60, 1.78) 

Baseline statin 

Cox model 

Multivariable 

138,338 

1.54 (1.46, 1.63) 

Time-varying statin 

Proposed model 

Uni variable 

152,830 

2.49 (2.31, 2.68) 

Time-varying statin 

Proposed model 

Multivariable 

138,338 

1.88 (1.75, 2.02) 

Time-varying statin 

Cox model 

Uni variable 

152,830 

1.65 (1.59, 1.72) 

Time-varying statin 

Cox model 

Multivariable 

138,338 

1.48 (1.42, 1.54) 


* Covariates adjusted include race, smoking status, alcohol intake, age, education, WHI 
study, BMI, recreational physical activity, dietary energy intake, family history of diabetes 
and hormone therapy use. 
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(2.6) , the hazard ratio for baseline statin use from univariate analysis was 
2.33 (95% Cl: 2.12-2.56). In the multivariable model, the hazard ratio of 
baseline statin use was 1.81 (95% Cl: 1.65-1.99), suggesting a relatively 
strong confounding effect. When statin use was modeled as a time-varying 
covariate, the hazard ratios of statin use from univariate and multivariate 
models were 2.49 (95% Cl: 2.31-2.68) and 1.88 (95% Cl: 1.75-2.02), respec¬ 
tively. 

The goodness of fit of the multivariable model incorporating statin use as 
a time-varying covariate was assessed in an augmented model that included 2 
additional terms corresponding to the interactions of time periods (in years) 

(3.6] and (6,16] with statin use. This model allows the effect of statin use to 
vary between the time periods (0,3], (3,6] and (6,16] years. The Wald test p 
values corresponding to the interactions of statin use with the time periods 

(3,6] and (6,16] were 0.89 and 0.11, respectively; these results indicate that 
the augmented model provided no improvement in fit when compared to the 
model without the additional interaction terms. 

To evaluate how the results depend on the choice of parameters such as 
sensitivity, specificity and baseline negative predictive value of self-reported 
diabetes, we performed a sensitivity analysis by varying each of these param¬ 
eters. Table 4 presents how the estimated hazard ratio of statin use changes 
with different combinations of the parameters. Statin use was modeled as 
a time-varying covariate while simultaneously adjusting for potential con- 
founders. We observed that the estimated hazard ratio of statin use is most 
sensitive to change in specificity. This is largely due to the fact that the 
cumulative incidence of diabetes was low (10.4%), and thus false positive 
self-reports due to imperfect specificity have a big influence on estimated 
parameters. In general, the hazard ratio of statin use decreases as specificity 
increases. Changes in sensitivity and negative predictive value at baseline 
have modest effects on the resulting model ht. 

The models presented here can be implemented using our freely available 
R software package icensmis [Gu and Balasubramanian (2013)] as described 
in the supplemental material [Gu, Ma and Balasubramanian (2015)]. 

5. Discussion. Due to cost considerations, the use of self-reported out¬ 
comes is common to diagnose prevalent and incident disease in large-scale 
epidemiologic investigations. In this paper we present a likelihood-based 
framework to model the association of a time-varying covariate with a time 
to event outcome, that is observed through periodically collected, error- 
prone, self-reported data. We incorporate the possibility of erroneous inclu¬ 
sion of subjects who have already experienced the event of interest prior 
to study entry as a result of the use of self-reported outcomes at baseline 
in determining the study population. R code for implementing the mod¬ 
els proposed here are presented in the supplemental material [Gu, Ma and 
Balasubramanian (2015)]. 
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Table 4 

Statin use versus risk of incident diabetes mellitus in the WHI—sensitivity analysis for 
varying sensitivity (ifii), specificity (ipo) and baseline negative predictive value (rj) 
associated with diabetes self-reports. All models incorporate statin use as a time-varying 
covariate and adjust for potential confounders 


Sensitivity {(pi) 

Specificity (v^o) 

Negative predictive 
value (rj) 

Hazard ratio 
(95% Cl) 

0.50 

0.993 

0.96 

2.11 (1.92, 2.31) 

0.50 

0.993 

0.98 

2.10 (1.92, 2.30) 

0.50 

0.995 

0.96 

1.93 (1.79, 2.08) 

0.50 

0.995 

0.98 

1.93 (1.79, 2.07) 

0.50 

0.997 

0.96 

1.76 (1.65, 1.88) 

0.50 

0.997 

0.98 

1.77 (1.66, 1.88) 

0.61 

0.993 

0.96 

2.05 (1.88, 2.24) 

0.61 

0.993 

0.98 

2.06 (1.89, 2.24) 

0.61 

0.995 

0.96 

1.88 (1.75, 2.02) 

0.61 

0.995 

0.98 

1.89 (1.76, 2.03) 

0.61 

0.997 

0.96 

1.73 (1.63, 1.84) 

0.61 

0.997 

0.98 

1.74 (1.64, 1.84) 

0.70 

0.993 

0.96 

2.02 (1.85, 2.20) 

0.70 

0.993 

0.98 

2.03 (1.86, 2.21) 

0.70 

0.995 

0.96 

1.86 (1.73, 2.00) 

0.70 

0.995 

0.98 

1.87 (1.74, 2.00) 

0.70 

0.997 

0.96 

1.71 (1.61, 1.82) 

0.70 

0.997 

0.98 

1.72 (1.62, 1.82) 


We presented results from simulation studies to assess the impact of ignor¬ 
ing error in self-reported outcomes—in all cases considered, the use of sta¬ 
tistical models that correctly accommodate the error inherent in self-reports 
resulted in nearly unbiased estimates of the regression parameter of interest. 
The largest bias as a result of ignoring error in self-reported outcomes was 
found in settings where the cumulative incidence was low and specificity was 
less than perfect. Models that correctly accommodate error in self-reports 
also resulted in increased variance of the estimated regression parameters. 
However, in most settings, the RMSE values that combine the impact of 
bias and variance of the estimated regression parameter favored the use of 
methods that appropriately account for error in self-reported outcomes. 

The methods proposed in this paper were applied to prospective data 
from 152,830 women enrolled in the WHI to evaluate the effect of statin use 
and risk of incident diabetes. By accounting for the imperfect sensitivity, 
specificity and negative predictive value at baseline for diabetes self-reports, 
we observed that the hazard ratio for statin use was significantly larger 
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than that estimated in naive analyses that ignored the error in self-reported 
outcomes. In particular, the hazard ratio of statin use in a multivariable 
model adjusted for potential confounders was 1.88 (95% Cl: 1.75-2.02) as 
compared to the multivariable hazard ratio estimate from Cox model 1.48 
(95% Cl: 1.42-1.54). 

In the methods developed here, we assumed that the sensitivity and speci¬ 
ficity of self-reported outcomes are invariant with respect to time since entry 
and independent of covariates. In many real-world settings, this assumption 
may result in over-simplified models, particularly in applications in which 
visits are unequally spaced. In addition, the methods developed here as¬ 
sumed that the parameters governing the characteristics of self-reported 
outcomes are known. However, in many cases these are estimated values— 
in this context, it would be useful to extend the methods proposed here to 
consider study designs including validation subsets that would allow joint es¬ 
timation of the sensitivity and specificity of self-reported outcomes together 
with the other parameters of interest. 

Acknowledgments. We are grateful to the Editors and referees for their 
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SUPPLEMENTARY MATERIAL 

Tutorial for using the R package icensmis (DOI: 10.1214/15-AOAS810SUPP 
.pdf). We present a short tutorial using the R package icensmis to perform 
the analysis described in this paper. 
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